AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio. You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan. Objective
1. To predict whether a liability customer will buy a personal loan or not.
2. Which variables are most significant.
3. Which segment of customers should be targeted more.
Data Dictionary
Best Practices for Notebook : • The notebook should be well-documented, with inline comments explaining the functionality of code and markdown cells containing comments on the observations and insights. • The notebook should be run from start to finish in a sequential manner before submission. • It is preferable to remove all warnings and errors before submission. • The notebook should be submitted as an HTML file (.html) and as a notebook file (.ipynb) Submission Guidelines :
1. There are two parts to the submission:
1. A well commented Jupyter notebook [format - .ipynb]
2. File converted to HTML format
2. Any assignment found copied/ plagiarized with other groups will not be graded and awarded zero marks
3. Please ensure timely submission as any submission post-deadline will not be accepted for evaluation
4. Submission will not be evaluated if,
1. it is submitted post-deadline, or,
2. more than 2 files are submitted
# this will help in making the Python code more structured automatically (good coding practice)
## %load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
## Read the dataset
loans = pd.read_csv("Loan_Modelling.csv")
# copying data to another varaible to avoid any changes to original data
data = loans.copy()
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.shape
(5000, 14)
#Let's check the duplicate data. And if any, we should remove it
data[data.duplicated()].count()
# No duplicate value are found
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
## Check the data types of the columns for the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
## New variable "Buy_Personal_Loan" to predict wheteher the customer will buy a Personal loan or not
data["BuyPersonal_Loan"]=((data["Personal_Loan"]>0) |(data["Securities_Account"]>0) |(data["CD_Account"]>0)|(data["CreditCard"]>0) )
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | BuyPersonal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | False |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 | False |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | False |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | False |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | True |
data.BuyPersonal_Loan.value_counts()
False 2857 True 2143 Name: BuyPersonal_Loan, dtype: int64
##Checking for missing values
data.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 BuyPersonal_Loan 0 dtype: int64
data_1 = data.copy()
## data_1["Postal_Code"]=data_1["ZIPCode"]
##data_1.drop("ZIP_City", axis=1, inplace=True)
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | BuyPersonal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | False |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 | False |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | False |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | False |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | True |
! pip install uszipcode
from uszipcode import SearchEngine, SimpleZipcode, Zipcode
search = SearchEngine(simple_zipcode=False)
Requirement already satisfied: uszipcode in d:\programs\anaconda3\lib\site-packages (0.2.6) Requirement already satisfied: requests in d:\programs\anaconda3\lib\site-packages (from uszipcode) (2.24.0) Requirement already satisfied: pathlib-mate in d:\programs\anaconda3\lib\site-packages (from uszipcode) (1.0.1) Requirement already satisfied: SQLAlchemy in d:\programs\anaconda3\lib\site-packages (from uszipcode) (1.3.20) Requirement already satisfied: attrs in d:\programs\anaconda3\lib\site-packages (from uszipcode) (20.3.0) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in d:\programs\anaconda3\lib\site-packages (from requests->uszipcode) (1.25.11) Requirement already satisfied: idna<3,>=2.5 in d:\programs\anaconda3\lib\site-packages (from requests->uszipcode) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in d:\programs\anaconda3\lib\site-packages (from requests->uszipcode) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in d:\programs\anaconda3\lib\site-packages (from requests->uszipcode) (2020.6.20) Requirement already satisfied: atomicwrites in d:\programs\anaconda3\lib\site-packages (from pathlib-mate->uszipcode) (1.4.0) Requirement already satisfied: six in d:\programs\anaconda3\lib\site-packages (from pathlib-mate->uszipcode) (1.15.0) Requirement already satisfied: autopep8 in d:\programs\anaconda3\lib\site-packages (from pathlib-mate->uszipcode) (1.5.4) Requirement already satisfied: toml in d:\programs\anaconda3\lib\site-packages (from autopep8->pathlib-mate->uszipcode) (0.10.1) Requirement already satisfied: pycodestyle>=2.6.0 in d:\programs\anaconda3\lib\site-packages (from autopep8->pathlib-mate->uszipcode) (2.6.0)
## Function for returning CITY based on ZIP code lookup
def zco_city(x):
res = search.by_zipcode(x)
return res.major_city
def zco_state(x):
res = search.by_zipcode(x)
return res.state
## Creating a new column ZIP_City to have the CITY name of the
data_1["ZIP_City"]=''
# data_1["ZIP_City"] = data_1["ZIPCode"].apply(zco)
data_1["ZIP_City"]= data_1["ZIPCode"].transform(zco_city)
## Creating a new column ZIP_State to have the CITY name of the Zip code as categorical variable
data_1["ZIP_State"]=''
# data_1["ZIP_City"] = data_1["ZIPCode"].apply(zco)
data_1["ZIP_State"]= data_1["ZIPCode"].transform(zco_city)
## Drop ZIPCode column as it is numerical and needs to be categorical
data_1.drop(["ZIPCode","ID"], axis=1, inplace=True)
data_1.tail()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | BuyPersonal_Loan | ZIP_City | ZIP_State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 29 | 3 | 40 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | False | Irvine | Irvine |
| 4996 | 30 | 4 | 15 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 | False | La Jolla | La Jolla |
| 4997 | 63 | 39 | 24 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | False | Ojai | Ojai |
| 4998 | 65 | 40 | 49 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | False | Los Angeles | Los Angeles |
| 4999 | 28 | 4 | 83 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | True | Irvine | Irvine |
data_1.info()
data_1.isnull().sum()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 BuyPersonal_Loan 5000 non-null bool 13 ZIP_City 4966 non-null object 14 ZIP_State 4966 non-null object dtypes: bool(1), float64(1), int64(11), object(2) memory usage: 551.9+ KB
Age 0 Experience 0 Income 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 BuyPersonal_Loan 0 ZIP_City 34 ZIP_State 34 dtype: int64
### Checking for NULLS in the ZIP_City
data_1[data_1.ZIP_City.isnull()]
data_1[data_1.ZIP_City.isnull()].count()
Age 34 Experience 34 Income 34 Family 34 CCAvg 34 Education 34 Mortgage 34 Personal_Loan 34 Securities_Account 34 CD_Account 34 Online 34 CreditCard 34 BuyPersonal_Loan 34 ZIP_City 0 ZIP_State 0 dtype: int64
data_1.dropna(subset=["ZIP_City"], inplace=True)
# let us reset the dataframe index
data_1.reset_index(inplace=True, drop=True)
data=data_1.copy()
data.shape
(4966, 15)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4966 entries, 0 to 4965 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 4966 non-null int64 1 Experience 4966 non-null int64 2 Income 4966 non-null int64 3 Family 4966 non-null int64 4 CCAvg 4966 non-null float64 5 Education 4966 non-null int64 6 Mortgage 4966 non-null int64 7 Personal_Loan 4966 non-null int64 8 Securities_Account 4966 non-null int64 9 CD_Account 4966 non-null int64 10 Online 4966 non-null int64 11 CreditCard 4966 non-null int64 12 BuyPersonal_Loan 4966 non-null bool 13 ZIP_City 4966 non-null object 14 ZIP_State 4966 non-null object dtypes: bool(1), float64(1), int64(11), object(2) memory usage: 548.1+ KB
## Since I amn not getting any city from ZIP code we ill be dropping the
# Summary statistics
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 4966 | NaN | NaN | NaN | 45.3538 | 11.4628 | 23 | 35 | 45 | 55 | 67 |
| Experience | 4966 | NaN | NaN | NaN | 20.1198 | 11.4679 | -3 | 10 | 20 | 30 | 43 |
| Income | 4966 | NaN | NaN | NaN | 73.8278 | 46.0423 | 8 | 39 | 64 | 98 | 224 |
| Family | 4966 | NaN | NaN | NaN | 2.3971 | 1.14761 | 1 | 1 | 2 | 3 | 4 |
| CCAvg | 4966 | NaN | NaN | NaN | 1.93702 | 1.74393 | 0 | 0.7 | 1.5 | 2.5 | 10 |
| Education | 4966 | NaN | NaN | NaN | 1.88039 | 0.840197 | 1 | 1 | 2 | 3 | 3 |
| Mortgage | 4966 | NaN | NaN | NaN | 56.6687 | 101.865 | 0 | 0 | 0 | 101 | 635 |
| Personal_Loan | 4966 | NaN | NaN | NaN | 0.0960532 | 0.294694 | 0 | 0 | 0 | 0 | 1 |
| Securities_Account | 4966 | NaN | NaN | NaN | 0.104108 | 0.305431 | 0 | 0 | 0 | 0 | 1 |
| CD_Account | 4966 | NaN | NaN | NaN | 0.0608135 | 0.239012 | 0 | 0 | 0 | 0 | 1 |
| Online | 4966 | NaN | NaN | NaN | 0.596859 | 0.490578 | 0 | 0 | 1 | 1 | 1 |
| CreditCard | 4966 | NaN | NaN | NaN | 0.293596 | 0.455455 | 0 | 0 | 0 | 1 | 1 |
| BuyPersonal_Loan | 4966 | 2 | False | 2841 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ZIP_City | 4966 | 244 | Los Angeles | 375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ZIP_State | 4966 | 244 | Los Angeles | 375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
labeled_barplot(data, "Age", perc=True)
histogram_boxplot(data, "Age", bins=70)
#### Observations on Experience
labeled_barplot(data, "Experience", perc=True)
histogram_boxplot(data, "Experience", bins=70)
labeled_barplot(data, "Income", perc=True)
histogram_boxplot(data, "Income", bins=70)
# Creating a new column with the transformed variable
data["Income_log"] = np.log(data["Income"])
histogram_boxplot(data, "Income_log", kde=True)
labeled_barplot(data, "Education", perc=True)
histogram_boxplot(data, "Mortgage", bins=10)
labeled_barplot(data, "Personal_Loan", perc=True)
histogram_boxplot(data, "Personal_Loan", bins=70)
-90% + Most of the customers did not accept personal loans
labeled_barplot(data, "Securities_Account", perc=True)
histogram_boxplot(data, "Securities_Account", bins=70)
-89%+ Most of the customers did not accept Securities_Account loans
labeled_barplot(data, "CD_Account", perc=True)
-93 %+ of the customers did not accept Securities_Account loans
labeled_barplot(data, "Online", perc=True)
data[data.Income_log.isnull()].count()
## data.Mortgage_log.fillna(0,inplace=True)
data.Income_log.value_counts()
3.784190 85 3.637586 84 4.394449 82 3.713572 82 3.663562 80 3.688879 78 3.737670 76 4.418841 74 3.761200 70 3.806662 69 3.367296 67 4.442651 65 3.091042 65 3.218876 64 3.555348 63 4.430817 63 3.401197 63 3.332205 62 3.044522 62 4.356709 61 4.007333 61 4.406719 61 4.158883 60 4.174387 60 3.465736 58 4.382027 56 4.110874 56 4.060443 55 4.127134 55 3.970292 55 3.433987 54 4.369448 53 2.890372 53 3.135494 53 4.077537 53 2.944439 52 3.526361 52 3.988984 52 3.496508 51 3.891820 51 4.094345 51 4.317488 47 2.995732 47 4.248495 47 3.178054 46 3.951244 46 4.234107 45 4.304065 45 3.912023 45 4.290459 44 4.143135 44 3.871201 44 4.262680 43 3.931826 41 4.276666 41 4.499810 38 4.532599 37 4.510860 37 4.219508 35 4.727388 34 4.488636 34 2.708050 33 2.639057 31 2.564949 31 4.736198 30 2.484907 30 4.521789 29 4.584967 28 2.397895 27 4.477337 26 4.718499 26 2.197225 25 4.553877 25 4.543295 25 4.744932 25 4.804021 24 4.948760 24 4.595120 24 4.852030 24 2.302585 23 4.976734 23 2.079442 23 4.615121 23 4.828314 23 4.859812 23 4.709530 22 5.036953 21 4.644391 20 4.795791 20 4.653960 20 5.003946 20 4.897840 20 4.700480 19 4.941642 19 4.875197 19 4.882802 18 4.867534 18 4.634729 18 5.062595 18 5.192957 18 4.927254 18 4.770685 18 4.905275 18 4.691348 18 4.779123 18 4.812184 18 5.187386 17 4.787492 17 5.043425 17 4.934474 16 5.081404 16 4.624973 16 5.273000 15 4.890349 15 5.023881 15 4.955827 15 4.682131 15 5.252273 13 5.099866 13 5.153292 13 5.204007 13 5.209486 12 5.135798 12 5.164786 12 5.214936 12 5.075174 12 4.820282 12 5.030438 11 5.247024 11 5.010635 11 4.997212 11 5.147494 11 5.105945 11 5.087596 10 5.181784 10 4.605170 10 5.236442 10 5.093750 9 4.962845 9 5.159055 9 5.220356 9 5.141664 8 5.267858 8 5.198497 8 5.123964 8 5.129899 7 5.068904 7 4.969813 7 5.257495 6 5.262690 6 5.303305 5 5.017280 4 5.298317 3 5.318120 3 5.293305 3 5.288267 3 5.313206 2 5.241747 2 5.308268 2 5.323010 2 5.384495 1 5.411646 1 Name: Income_log, dtype: int64
## data.Mortgage_log.fillna(0,inplace=True)
##data.Mortgage_log.value_counts()
## Drop ZIPCode column as it is numerical and needs to be categorical
##data.drop(["Mortgage_log"], axis=1, inplace=True)
data.tail()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | BuyPersonal_Loan | ZIP_City | ZIP_State | Income_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4961 | 29 | 3 | 40 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | False | Irvine | Irvine | 3.688879 |
| 4962 | 30 | 4 | 15 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 | False | La Jolla | La Jolla | 2.708050 |
| 4963 | 63 | 39 | 24 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | False | Ojai | Ojai | 3.178054 |
| 4964 | 65 | 40 | 49 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | False | Los Angeles | Los Angeles | 3.891820 |
| 4965 | 28 | 4 | 83 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | True | Irvine | Irvine | 4.418841 |
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
sns.pairplot(data=data, hue="BuyPersonal_Loan")
plt.show()
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral")
plt.show()
stacked_barplot(data, "Family","Personal_Loan")
Personal_Loan 0 1 All Family All 4489 477 4966 4 1082 133 1215 3 870 132 1002 1 1354 106 1460 2 1183 106 1289 ------------------------------------------------------------------------------------------------------------------------
data.ZIP_City.value_counts()
Los Angeles 375
San Diego 269
San Francisco 257
Berkeley 241
Sacramento 148
...
Tahoe City 1
Sierra Madre 1
Ladera Ranch 1
Sausalito 1
Stinson Beach 1
Name: ZIP_City, Length: 244, dtype: int64
df_CITY = data[["ZIP_City","Personal_Loan","CD_Account","CreditCard","Education","Mortgage"]]
df_CITY.sort_values('Mortgage',inplace=True, ascending=False)
## df_CITY.groupby("ZIP_City")["Mortgage"].head(10)
##df_CITY.groupby("ZIP_City")["Mortgage"].mean()
## df_CITY.groupby("ZIP_City")["Mortgage"].mean().nlargest()
df_CITY
## df_CITYL
| ZIP_City | Personal_Loan | CD_Account | CreditCard | Education | Mortgage | |
|---|---|---|---|---|---|---|
| 2908 | Montclair | 0 | 0 | 0 | 1 | 635 |
| 300 | West Sacramento | 1 | 0 | 0 | 1 | 617 |
| 4778 | San Diego | 1 | 0 | 0 | 3 | 612 |
| 1764 | Berkeley | 0 | 0 | 0 | 1 | 601 |
| 4808 | Hopland | 1 | 0 | 0 | 2 | 590 |
| ... | ... | ... | ... | ... | ... | ... |
| 1956 | Davis | 0 | 0 | 0 | 1 | 0 |
| 1957 | Palo Alto | 0 | 0 | 1 | 3 | 0 |
| 1958 | Oakland | 0 | 0 | 0 | 3 | 0 |
| 1959 | Northridge | 0 | 0 | 0 | 1 | 0 |
| 4965 | Irvine | 0 | 0 | 1 | 1 | 0 |
4966 rows × 6 columns
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=df_CITY, y="Mortgage", x="ZIP_City")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=df_CITY, y="Mortgage", x="ZIP_City")
plt.xticks(rotation=90)
plt.show()
## stacked_barplot(df_CITY, "Mortgage", "ZIP_City")
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=data, y="Income", x="BuyPersonal_Loan")
sns.barplot(data=data, y="Income_log", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=data, y="Income", x="BuyPersonal_Loan")
sns.boxplot(data=data, y="Income_log", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=data, y="Income", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=data, y="Income", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.show()
###### Observation on BuyPersonal_Loan versus Age
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=data, y="Age", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=data, y="Age", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.show()
###### Observation on BuyPersonal_Loan versus Income/Income_log
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=data, y="Mortgage", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=data, y="Mortgage", x="BuyPersonal_Loan")
plt.xticks(rotation=90)
plt.show()
Prepare the data for analysis - Missing value Treatment, Outlier Detection(treat, if needed), Feature Engineering, Prepare data for modelling and check the split
data["ZIP_State"].value_counts()
Los Angeles 375
San Diego 269
San Francisco 257
Berkeley 241
Sacramento 148
...
Tahoe City 1
Sierra Madre 1
Ladera Ranch 1
Sausalito 1
Stinson Beach 1
Name: ZIP_State, Length: 244, dtype: int64
# dropping ZIP_State as the data pertains to CA only
data = data.drop(["ZIP_State"], axis=1)
data.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | BuyPersonal_Loan | ZIP_City | Income_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | True | Pasadena | 3.891820 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | True | Los Angeles | 3.526361 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | False | Berkeley | 2.397895 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | False | San Francisco | 4.605170 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | True | Northridge | 3.806662 |
## SKIPPING Converting a the predicted variable to a 0 or 1 value
## data["BuyPersonal_Loan"] = data["BuyPersonal_Loan"].apply(lambda x: 1 if x == True else 0)
data.BuyPersonal_Loan.value_counts()
False 2841 True 2125 Name: BuyPersonal_Loan, dtype: int64
# creating dummy varibles and updating data frame
dummy_data = pd.get_dummies(
data,
columns=["ZIP_City",
],
drop_first=True, ## Drop_first reduces multicolinearity
)
dummy_data.head()
data=dummy_data.copy()
data.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | BuyPersonal_Loan | Income_log | ZIP_City_Alameda | ZIP_City_Alamo | ZIP_City_Albany | ZIP_City_Alhambra | ZIP_City_Anaheim | ZIP_City_Antioch | ZIP_City_Aptos | ZIP_City_Arcadia | ZIP_City_Arcata | ZIP_City_Bakersfield | ZIP_City_Baldwin Park | ZIP_City_Banning | ZIP_City_Bella Vista | ZIP_City_Belmont | ZIP_City_Belvedere Tiburon | ZIP_City_Ben Lomond | ZIP_City_Berkeley | ZIP_City_Beverly Hills | ZIP_City_Bodega Bay | ZIP_City_Bonita | ZIP_City_Boulder Creek | ZIP_City_Brea | ZIP_City_Brisbane | ZIP_City_Burlingame | ZIP_City_Calabasas | ZIP_City_Camarillo | ZIP_City_Campbell | ZIP_City_Canoga Park | ZIP_City_Capistrano Beach | ZIP_City_Capitola | ZIP_City_Cardiff By The Sea | ZIP_City_Carlsbad | ZIP_City_Carpinteria | ZIP_City_Carson | ZIP_City_Castro Valley | ZIP_City_Ceres | ZIP_City_Chatsworth | ZIP_City_Chico | ZIP_City_Chino | ZIP_City_Chino Hills | ZIP_City_Chula Vista | ZIP_City_Citrus Heights | ZIP_City_Claremont | ZIP_City_Clearlake | ZIP_City_Clovis | ZIP_City_Concord | ZIP_City_Costa Mesa | ZIP_City_Crestline | ZIP_City_Culver City | ZIP_City_Cupertino | ZIP_City_Cypress | ZIP_City_Daly City | ZIP_City_Danville | ZIP_City_Davis | ZIP_City_Diamond Bar | ZIP_City_Edwards | ZIP_City_El Dorado Hills | ZIP_City_El Segundo | ZIP_City_El Sobrante | ZIP_City_Elk Grove | ZIP_City_Emeryville | ZIP_City_Encinitas | ZIP_City_Escondido | ZIP_City_Eureka | ZIP_City_Fairfield | ZIP_City_Fallbrook | ZIP_City_Fawnskin | ZIP_City_Folsom | ZIP_City_Fremont | ZIP_City_Fresno | ZIP_City_Fullerton | ZIP_City_Garden Grove | ZIP_City_Gilroy | ZIP_City_Glendale | ZIP_City_Glendora | ZIP_City_Goleta | ZIP_City_Greenbrae | ZIP_City_Hacienda Heights | ZIP_City_Half Moon Bay | ZIP_City_Hawthorne | ZIP_City_Hayward | ZIP_City_Hermosa Beach | ZIP_City_Highland | ZIP_City_Hollister | ZIP_City_Hopland | ZIP_City_Huntington Beach | ZIP_City_Imperial | ZIP_City_Inglewood | ZIP_City_Irvine | ZIP_City_La Jolla | ZIP_City_La Mesa | ZIP_City_La Mirada | ZIP_City_La Palma | ZIP_City_Ladera Ranch | ZIP_City_Laguna Hills | ZIP_City_Laguna Niguel | ZIP_City_Lake Forest | ZIP_City_Larkspur | ZIP_City_Livermore | ZIP_City_Loma Linda | ZIP_City_Lomita | ZIP_City_Lompoc | ZIP_City_Long Beach | ZIP_City_Los Alamitos | ZIP_City_Los Altos | ZIP_City_Los Angeles | ZIP_City_Los Gatos | ZIP_City_Manhattan Beach | ZIP_City_March Air Reserve Base | ZIP_City_Marina | ZIP_City_Martinez | ZIP_City_Menlo Park | ZIP_City_Merced | ZIP_City_Milpitas | ZIP_City_Mission Hills | ZIP_City_Mission Viejo | ZIP_City_Modesto | ZIP_City_Monrovia | ZIP_City_Montague | ZIP_City_Montclair | ZIP_City_Montebello | ZIP_City_Monterey | ZIP_City_Monterey Park | ZIP_City_Moraga | ZIP_City_Morgan Hill | ZIP_City_Moss Landing | ZIP_City_Mountain View | ZIP_City_Napa | ZIP_City_National City | ZIP_City_Newbury Park | ZIP_City_Newport Beach | ZIP_City_North Hills | ZIP_City_North Hollywood | ZIP_City_Northridge | ZIP_City_Norwalk | ZIP_City_Novato | ZIP_City_Oak View | ZIP_City_Oakland | ZIP_City_Oceanside | ZIP_City_Ojai | ZIP_City_Orange | ZIP_City_Oxnard | ZIP_City_Pacific Grove | ZIP_City_Pacific Palisades | ZIP_City_Palo Alto | ZIP_City_Palos Verdes Peninsula | ZIP_City_Pasadena | ZIP_City_Placentia | ZIP_City_Pleasant Hill | ZIP_City_Pleasanton | ZIP_City_Pomona | ZIP_City_Porter Ranch | ZIP_City_Portola Valley | ZIP_City_Poway | ZIP_City_Rancho Cordova | ZIP_City_Rancho Cucamonga | ZIP_City_Rancho Palos Verdes | ZIP_City_Redding | ZIP_City_Redlands | ZIP_City_Redondo Beach | ZIP_City_Redwood City | ZIP_City_Reseda | ZIP_City_Richmond | ZIP_City_Ridgecrest | ZIP_City_Rio Vista | ZIP_City_Riverside | ZIP_City_Rohnert Park | ZIP_City_Rosemead | ZIP_City_Roseville | ZIP_City_Sacramento | ZIP_City_Salinas | ZIP_City_San Anselmo | ZIP_City_San Bernardino | ZIP_City_San Bruno | ZIP_City_San Clemente | ZIP_City_San Diego | ZIP_City_San Dimas | ZIP_City_San Francisco | ZIP_City_San Gabriel | ZIP_City_San Jose | ZIP_City_San Juan Bautista | ZIP_City_San Juan Capistrano | ZIP_City_San Leandro | ZIP_City_San Luis Obispo | ZIP_City_San Luis Rey | ZIP_City_San Marcos | ZIP_City_San Mateo | ZIP_City_San Pablo | ZIP_City_San Rafael | ZIP_City_San Ramon | ZIP_City_San Ysidro | ZIP_City_Sanger | ZIP_City_Santa Ana | ZIP_City_Santa Barbara | ZIP_City_Santa Clara | ZIP_City_Santa Clarita | ZIP_City_Santa Cruz | ZIP_City_Santa Monica | ZIP_City_Santa Rosa | ZIP_City_Santa Ynez | ZIP_City_Saratoga | ZIP_City_Sausalito | ZIP_City_Seal Beach | ZIP_City_Seaside | ZIP_City_Sherman Oaks | ZIP_City_Sierra Madre | ZIP_City_Signal Hill | ZIP_City_Simi Valley | ZIP_City_Sonora | ZIP_City_South Gate | ZIP_City_South Lake Tahoe | ZIP_City_South Pasadena | ZIP_City_South San Francisco | ZIP_City_Stanford | ZIP_City_Stinson Beach | ZIP_City_Stockton | ZIP_City_Studio City | ZIP_City_Sunland | ZIP_City_Sunnyvale | ZIP_City_Sylmar | ZIP_City_Tahoe City | ZIP_City_Tehachapi | ZIP_City_Thousand Oaks | ZIP_City_Torrance | ZIP_City_Trinity Center | ZIP_City_Tustin | ZIP_City_Ukiah | ZIP_City_Upland | ZIP_City_Valencia | ZIP_City_Vallejo | ZIP_City_Van Nuys | ZIP_City_Venice | ZIP_City_Ventura | ZIP_City_Vista | ZIP_City_Walnut Creek | ZIP_City_Weed | ZIP_City_West Covina | ZIP_City_West Sacramento | ZIP_City_Westlake Village | ZIP_City_Whittier | ZIP_City_Woodland Hills | ZIP_City_Yorba Linda | ZIP_City_Yucaipa | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | True | 3.891820 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | True | 3.526361 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | False | 2.397895 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | False | 4.605170 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | True | 3.806662 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Comment on which metric is right for model performance evaluation and why? - Can model performance be improved? If yes, then do it using appropriate techniques for logistic regression and comment on model performance after improvement.
X = data.drop(["BuyPersonal_Loan"], axis=1)
Y = data["BuyPersonal_Loan"]
## get does create dummies (one hot encoding for all the columns )
## X = pd.get_dummies(X, drop_first=True) ## Drop_first reduces multicolinearity
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
X_train.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Income_log | ZIP_City_Alameda | ZIP_City_Alamo | ZIP_City_Albany | ZIP_City_Alhambra | ZIP_City_Anaheim | ZIP_City_Antioch | ZIP_City_Aptos | ZIP_City_Arcadia | ZIP_City_Arcata | ZIP_City_Bakersfield | ZIP_City_Baldwin Park | ZIP_City_Banning | ZIP_City_Bella Vista | ZIP_City_Belmont | ZIP_City_Belvedere Tiburon | ZIP_City_Ben Lomond | ZIP_City_Berkeley | ZIP_City_Beverly Hills | ZIP_City_Bodega Bay | ZIP_City_Bonita | ZIP_City_Boulder Creek | ZIP_City_Brea | ZIP_City_Brisbane | ZIP_City_Burlingame | ZIP_City_Calabasas | ZIP_City_Camarillo | ZIP_City_Campbell | ZIP_City_Canoga Park | ZIP_City_Capistrano Beach | ZIP_City_Capitola | ZIP_City_Cardiff By The Sea | ZIP_City_Carlsbad | ZIP_City_Carpinteria | ZIP_City_Carson | ZIP_City_Castro Valley | ZIP_City_Ceres | ZIP_City_Chatsworth | ZIP_City_Chico | ZIP_City_Chino | ZIP_City_Chino Hills | ZIP_City_Chula Vista | ZIP_City_Citrus Heights | ZIP_City_Claremont | ZIP_City_Clearlake | ZIP_City_Clovis | ZIP_City_Concord | ZIP_City_Costa Mesa | ZIP_City_Crestline | ZIP_City_Culver City | ZIP_City_Cupertino | ZIP_City_Cypress | ZIP_City_Daly City | ZIP_City_Danville | ZIP_City_Davis | ZIP_City_Diamond Bar | ZIP_City_Edwards | ZIP_City_El Dorado Hills | ZIP_City_El Segundo | ZIP_City_El Sobrante | ZIP_City_Elk Grove | ZIP_City_Emeryville | ZIP_City_Encinitas | ZIP_City_Escondido | ZIP_City_Eureka | ZIP_City_Fairfield | ZIP_City_Fallbrook | ZIP_City_Fawnskin | ZIP_City_Folsom | ZIP_City_Fremont | ZIP_City_Fresno | ZIP_City_Fullerton | ZIP_City_Garden Grove | ZIP_City_Gilroy | ZIP_City_Glendale | ZIP_City_Glendora | ZIP_City_Goleta | ZIP_City_Greenbrae | ZIP_City_Hacienda Heights | ZIP_City_Half Moon Bay | ZIP_City_Hawthorne | ZIP_City_Hayward | ZIP_City_Hermosa Beach | ZIP_City_Highland | ZIP_City_Hollister | ZIP_City_Hopland | ZIP_City_Huntington Beach | ZIP_City_Imperial | ZIP_City_Inglewood | ZIP_City_Irvine | ZIP_City_La Jolla | ZIP_City_La Mesa | ZIP_City_La Mirada | ZIP_City_La Palma | ZIP_City_Ladera Ranch | ZIP_City_Laguna Hills | ZIP_City_Laguna Niguel | ZIP_City_Lake Forest | ZIP_City_Larkspur | ZIP_City_Livermore | ZIP_City_Loma Linda | ZIP_City_Lomita | ZIP_City_Lompoc | ZIP_City_Long Beach | ZIP_City_Los Alamitos | ZIP_City_Los Altos | ZIP_City_Los Angeles | ZIP_City_Los Gatos | ZIP_City_Manhattan Beach | ZIP_City_March Air Reserve Base | ZIP_City_Marina | ZIP_City_Martinez | ZIP_City_Menlo Park | ZIP_City_Merced | ZIP_City_Milpitas | ZIP_City_Mission Hills | ZIP_City_Mission Viejo | ZIP_City_Modesto | ZIP_City_Monrovia | ZIP_City_Montague | ZIP_City_Montclair | ZIP_City_Montebello | ZIP_City_Monterey | ZIP_City_Monterey Park | ZIP_City_Moraga | ZIP_City_Morgan Hill | ZIP_City_Moss Landing | ZIP_City_Mountain View | ZIP_City_Napa | ZIP_City_National City | ZIP_City_Newbury Park | ZIP_City_Newport Beach | ZIP_City_North Hills | ZIP_City_North Hollywood | ZIP_City_Northridge | ZIP_City_Norwalk | ZIP_City_Novato | ZIP_City_Oak View | ZIP_City_Oakland | ZIP_City_Oceanside | ZIP_City_Ojai | ZIP_City_Orange | ZIP_City_Oxnard | ZIP_City_Pacific Grove | ZIP_City_Pacific Palisades | ZIP_City_Palo Alto | ZIP_City_Palos Verdes Peninsula | ZIP_City_Pasadena | ZIP_City_Placentia | ZIP_City_Pleasant Hill | ZIP_City_Pleasanton | ZIP_City_Pomona | ZIP_City_Porter Ranch | ZIP_City_Portola Valley | ZIP_City_Poway | ZIP_City_Rancho Cordova | ZIP_City_Rancho Cucamonga | ZIP_City_Rancho Palos Verdes | ZIP_City_Redding | ZIP_City_Redlands | ZIP_City_Redondo Beach | ZIP_City_Redwood City | ZIP_City_Reseda | ZIP_City_Richmond | ZIP_City_Ridgecrest | ZIP_City_Rio Vista | ZIP_City_Riverside | ZIP_City_Rohnert Park | ZIP_City_Rosemead | ZIP_City_Roseville | ZIP_City_Sacramento | ZIP_City_Salinas | ZIP_City_San Anselmo | ZIP_City_San Bernardino | ZIP_City_San Bruno | ZIP_City_San Clemente | ZIP_City_San Diego | ZIP_City_San Dimas | ZIP_City_San Francisco | ZIP_City_San Gabriel | ZIP_City_San Jose | ZIP_City_San Juan Bautista | ZIP_City_San Juan Capistrano | ZIP_City_San Leandro | ZIP_City_San Luis Obispo | ZIP_City_San Luis Rey | ZIP_City_San Marcos | ZIP_City_San Mateo | ZIP_City_San Pablo | ZIP_City_San Rafael | ZIP_City_San Ramon | ZIP_City_San Ysidro | ZIP_City_Sanger | ZIP_City_Santa Ana | ZIP_City_Santa Barbara | ZIP_City_Santa Clara | ZIP_City_Santa Clarita | ZIP_City_Santa Cruz | ZIP_City_Santa Monica | ZIP_City_Santa Rosa | ZIP_City_Santa Ynez | ZIP_City_Saratoga | ZIP_City_Sausalito | ZIP_City_Seal Beach | ZIP_City_Seaside | ZIP_City_Sherman Oaks | ZIP_City_Sierra Madre | ZIP_City_Signal Hill | ZIP_City_Simi Valley | ZIP_City_Sonora | ZIP_City_South Gate | ZIP_City_South Lake Tahoe | ZIP_City_South Pasadena | ZIP_City_South San Francisco | ZIP_City_Stanford | ZIP_City_Stinson Beach | ZIP_City_Stockton | ZIP_City_Studio City | ZIP_City_Sunland | ZIP_City_Sunnyvale | ZIP_City_Sylmar | ZIP_City_Tahoe City | ZIP_City_Tehachapi | ZIP_City_Thousand Oaks | ZIP_City_Torrance | ZIP_City_Trinity Center | ZIP_City_Tustin | ZIP_City_Ukiah | ZIP_City_Upland | ZIP_City_Valencia | ZIP_City_Vallejo | ZIP_City_Van Nuys | ZIP_City_Venice | ZIP_City_Ventura | ZIP_City_Vista | ZIP_City_Walnut Creek | ZIP_City_Weed | ZIP_City_West Covina | ZIP_City_West Sacramento | ZIP_City_Westlake Village | ZIP_City_Whittier | ZIP_City_Woodland Hills | ZIP_City_Yorba Linda | ZIP_City_Yucaipa | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 902 | 53 | 28 | 184 | 1 | 8.1 | 1 | 303 | 0 | 0 | 0 | 1 | 0 | 5.214936 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3724 | 55 | 30 | 82 | 4 | 1.3 | 3 | 219 | 0 | 0 | 0 | 0 | 1 | 4.406719 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 701 | 41 | 16 | 10 | 2 | 0.3 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 2.302585 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3113 | 33 | 7 | 31 | 4 | 1.0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 3.433987 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4057 | 50 | 26 | 11 | 4 | 0.2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 2.397895 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3476, 256) Shape of test set : (1490, 256) Percentage of classes in training set: False 0.569908 True 0.430092 Name: BuyPersonal_Loan, dtype: float64 Percentage of classes in test set: False 0.577181 True 0.422819 Name: BuyPersonal_Loan, dtype: float64
print("{0:0.2f}% data is in training set".format((len(X_train)/len(data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(data.index)) * 100))
70.00% data is in training set 30.00% data is in test set
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.85):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.85):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(X_train, y_train)
## Test model Score
model_score = model.score(X_test, y_test)
print(model_score)
1.0
## Tranining model Score
model_score = model.score(X_train, y_train)
print(model_score)
1.0
# let us check the coefficients and intercept of the model
coef_df = pd.DataFrame(
np.append(lg.coef_, lg.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df.T
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Income_log | ZIP_City_Alameda | ZIP_City_Alamo | ZIP_City_Albany | ZIP_City_Alhambra | ZIP_City_Anaheim | ZIP_City_Antioch | ZIP_City_Aptos | ZIP_City_Arcadia | ZIP_City_Arcata | ZIP_City_Bakersfield | ZIP_City_Baldwin Park | ZIP_City_Banning | ZIP_City_Bella Vista | ZIP_City_Belmont | ZIP_City_Belvedere Tiburon | ZIP_City_Ben Lomond | ZIP_City_Berkeley | ZIP_City_Beverly Hills | ZIP_City_Bodega Bay | ZIP_City_Bonita | ZIP_City_Boulder Creek | ZIP_City_Brea | ZIP_City_Brisbane | ZIP_City_Burlingame | ZIP_City_Calabasas | ZIP_City_Camarillo | ZIP_City_Campbell | ZIP_City_Canoga Park | ZIP_City_Capistrano Beach | ZIP_City_Capitola | ZIP_City_Cardiff By The Sea | ZIP_City_Carlsbad | ZIP_City_Carpinteria | ZIP_City_Carson | ZIP_City_Castro Valley | ZIP_City_Ceres | ZIP_City_Chatsworth | ZIP_City_Chico | ZIP_City_Chino | ZIP_City_Chino Hills | ZIP_City_Chula Vista | ZIP_City_Citrus Heights | ZIP_City_Claremont | ZIP_City_Clearlake | ZIP_City_Clovis | ZIP_City_Concord | ZIP_City_Costa Mesa | ZIP_City_Crestline | ZIP_City_Culver City | ZIP_City_Cupertino | ZIP_City_Cypress | ZIP_City_Daly City | ZIP_City_Danville | ZIP_City_Davis | ZIP_City_Diamond Bar | ZIP_City_Edwards | ZIP_City_El Dorado Hills | ZIP_City_El Segundo | ZIP_City_El Sobrante | ZIP_City_Elk Grove | ZIP_City_Emeryville | ZIP_City_Encinitas | ZIP_City_Escondido | ZIP_City_Eureka | ZIP_City_Fairfield | ZIP_City_Fallbrook | ZIP_City_Fawnskin | ZIP_City_Folsom | ZIP_City_Fremont | ZIP_City_Fresno | ZIP_City_Fullerton | ZIP_City_Garden Grove | ZIP_City_Gilroy | ZIP_City_Glendale | ZIP_City_Glendora | ZIP_City_Goleta | ZIP_City_Greenbrae | ZIP_City_Hacienda Heights | ZIP_City_Half Moon Bay | ZIP_City_Hawthorne | ZIP_City_Hayward | ZIP_City_Hermosa Beach | ZIP_City_Highland | ZIP_City_Hollister | ZIP_City_Hopland | ZIP_City_Huntington Beach | ZIP_City_Imperial | ZIP_City_Inglewood | ZIP_City_Irvine | ZIP_City_La Jolla | ZIP_City_La Mesa | ZIP_City_La Mirada | ZIP_City_La Palma | ZIP_City_Ladera Ranch | ZIP_City_Laguna Hills | ZIP_City_Laguna Niguel | ZIP_City_Lake Forest | ZIP_City_Larkspur | ZIP_City_Livermore | ZIP_City_Loma Linda | ZIP_City_Lomita | ZIP_City_Lompoc | ZIP_City_Long Beach | ZIP_City_Los Alamitos | ZIP_City_Los Altos | ZIP_City_Los Angeles | ZIP_City_Los Gatos | ZIP_City_Manhattan Beach | ZIP_City_March Air Reserve Base | ZIP_City_Marina | ZIP_City_Martinez | ZIP_City_Menlo Park | ZIP_City_Merced | ZIP_City_Milpitas | ZIP_City_Mission Hills | ZIP_City_Mission Viejo | ZIP_City_Modesto | ZIP_City_Monrovia | ZIP_City_Montague | ZIP_City_Montclair | ZIP_City_Montebello | ZIP_City_Monterey | ZIP_City_Monterey Park | ZIP_City_Moraga | ZIP_City_Morgan Hill | ZIP_City_Moss Landing | ZIP_City_Mountain View | ZIP_City_Napa | ZIP_City_National City | ZIP_City_Newbury Park | ZIP_City_Newport Beach | ZIP_City_North Hills | ZIP_City_North Hollywood | ZIP_City_Northridge | ZIP_City_Norwalk | ZIP_City_Novato | ZIP_City_Oak View | ZIP_City_Oakland | ZIP_City_Oceanside | ZIP_City_Ojai | ZIP_City_Orange | ZIP_City_Oxnard | ZIP_City_Pacific Grove | ZIP_City_Pacific Palisades | ZIP_City_Palo Alto | ZIP_City_Palos Verdes Peninsula | ZIP_City_Pasadena | ZIP_City_Placentia | ZIP_City_Pleasant Hill | ZIP_City_Pleasanton | ZIP_City_Pomona | ZIP_City_Porter Ranch | ZIP_City_Portola Valley | ZIP_City_Poway | ZIP_City_Rancho Cordova | ZIP_City_Rancho Cucamonga | ZIP_City_Rancho Palos Verdes | ZIP_City_Redding | ZIP_City_Redlands | ZIP_City_Redondo Beach | ZIP_City_Redwood City | ZIP_City_Reseda | ZIP_City_Richmond | ZIP_City_Ridgecrest | ZIP_City_Rio Vista | ZIP_City_Riverside | ZIP_City_Rohnert Park | ZIP_City_Rosemead | ZIP_City_Roseville | ZIP_City_Sacramento | ZIP_City_Salinas | ZIP_City_San Anselmo | ZIP_City_San Bernardino | ZIP_City_San Bruno | ZIP_City_San Clemente | ZIP_City_San Diego | ZIP_City_San Dimas | ZIP_City_San Francisco | ZIP_City_San Gabriel | ZIP_City_San Jose | ZIP_City_San Juan Bautista | ZIP_City_San Juan Capistrano | ZIP_City_San Leandro | ZIP_City_San Luis Obispo | ZIP_City_San Luis Rey | ZIP_City_San Marcos | ZIP_City_San Mateo | ZIP_City_San Pablo | ZIP_City_San Rafael | ZIP_City_San Ramon | ZIP_City_San Ysidro | ZIP_City_Sanger | ZIP_City_Santa Ana | ZIP_City_Santa Barbara | ZIP_City_Santa Clara | ZIP_City_Santa Clarita | ZIP_City_Santa Cruz | ZIP_City_Santa Monica | ZIP_City_Santa Rosa | ZIP_City_Santa Ynez | ZIP_City_Saratoga | ZIP_City_Sausalito | ZIP_City_Seal Beach | ZIP_City_Seaside | ZIP_City_Sherman Oaks | ZIP_City_Sierra Madre | ZIP_City_Signal Hill | ZIP_City_Simi Valley | ZIP_City_Sonora | ZIP_City_South Gate | ZIP_City_South Lake Tahoe | ZIP_City_South Pasadena | ZIP_City_South San Francisco | ZIP_City_Stanford | ZIP_City_Stinson Beach | ZIP_City_Stockton | ZIP_City_Studio City | ZIP_City_Sunland | ZIP_City_Sunnyvale | ZIP_City_Sylmar | ZIP_City_Tahoe City | ZIP_City_Tehachapi | ZIP_City_Thousand Oaks | ZIP_City_Torrance | ZIP_City_Trinity Center | ZIP_City_Tustin | ZIP_City_Ukiah | ZIP_City_Upland | ZIP_City_Valencia | ZIP_City_Vallejo | ZIP_City_Van Nuys | ZIP_City_Venice | ZIP_City_Ventura | ZIP_City_Vista | ZIP_City_Walnut Creek | ZIP_City_Weed | ZIP_City_West Covina | ZIP_City_West Sacramento | ZIP_City_Westlake Village | ZIP_City_Whittier | ZIP_City_Woodland Hills | ZIP_City_Yorba Linda | ZIP_City_Yucaipa | Intercept | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coefficients | -0.000768 | 0.002001 | 0.015 | 0.118008 | 0.01952 | 0.224357 | 0.000042 | 6.90691 | 7.850852 | 0.664176 | -0.00099 | 9.030536 | -0.429207 | 0.000073 | -0.011696 | 0.001097 | -0.071555 | -0.030922 | 0.0 | -0.032771 | 0.014207 | -0.08427 | -0.024509 | -0.012104 | -0.031926 | 0.000001 | -0.006034 | -0.008066 | -0.038008 | -0.028177 | 0.025319 | 0.025489 | -0.015125 | -0.060502 | -0.012793 | -0.026921 | -0.04318 | 0.05545 | 0.010153 | 0.00257 | 0.040789 | -0.034199 | 0.000838 | -0.01486 | -0.063739 | 0.034616 | -0.005485 | 0.020631 | 0.0 | -0.003143 | -0.060108 | -0.006763 | 0.060129 | 0.105039 | -0.031437 | 0.001281 | 0.00081 | 0.029797 | 0.0506 | -0.033654 | -0.024778 | -0.022219 | -0.019884 | -0.057356 | 0.022206 | 0.0 | -0.081175 | 0.064942 | -0.009788 | -0.008065 | -0.122416 | 0.044354 | 0.009835 | 0.031936 | -0.022545 | 0.013931 | 0.029919 | 0.042701 | -0.007168 | 0.054472 | 0.001099 | -0.095573 | -0.042276 | -0.047874 | 0.020451 | 0.06641 | 0.032368 | 0.060922 | 0.029784 | 0.003773 | -0.029337 | -0.018439 | 0.048468 | -0.037918 | -0.068178 | -0.04265 | 0.025797 | -0.018966 | -0.03146 | 0.013854 | -0.005731 | 0.140032 | 0.005918 | -0.030542 | -0.008951 | -0.032471 | 0.0 | 0.025026 | -0.0273 | 0.004443 | 0.007254 | 0.011188 | -0.07465 | 0.034915 | 0.086863 | -0.067102 | 0.053544 | 0.078422 | -0.000301 | 0.057933 | -0.037816 | 0.016843 | 0.007121 | 0.066593 | -0.059463 | -0.061014 | -0.022927 | -0.018033 | 0.007895 | 0.045401 | 0.043595 | 0.007502 | -0.042068 | -0.022324 | 0.01816 | -0.015115 | 0.014951 | 0.002698 | 0.133734 | 0.007178 | 0.0 | -0.026299 | -0.073271 | -0.021457 | -0.018598 | -0.042746 | -0.076334 | 0.006044 | 0.035897 | 0.032618 | -0.060919 | 0.029342 | 0.048404 | 0.019754 | -0.00186 | 0.050121 | -0.008893 | 0.087929 | 0.02529 | 0.142796 | -0.016498 | -0.029534 | 0.075627 | -0.079852 | 0.0 | -0.023099 | 0.035083 | -0.058297 | -0.04348 | -0.019061 | 0.009917 | 0.050804 | -0.054576 | -0.100376 | -0.016251 | 0.084072 | -0.057607 | -0.009182 | -0.02234 | 0.003733 | -0.025447 | 0.021329 | 0.085163 | -0.003822 | 0.073172 | 0.012003 | -0.018155 | 0.054288 | -0.003853 | -0.019522 | 0.000799 | 0.02706 | 0.121288 | -0.005495 | -0.007843 | -0.009523 | 0.069522 | 0.024341 | -0.068317 | -0.057061 | 0.017858 | 0.006193 | -0.000982 | 0.083199 | -0.034412 | -0.005678 | -0.031055 | -0.003123 | 0.034572 | 0.072651 | 0.058311 | -0.002338 | 0.00345 | 0.038931 | 0.037862 | -0.078509 | 0.077947 | -0.026772 | 0.0 | 0.035737 | 0.003893 | 0.020298 | -0.003505 | -0.002335 | -0.023113 | -0.119705 | 0.041414 | 0.0 | -0.02326 | -0.034138 | -0.036229 | -0.018742 | 0.027248 | 0.0 | 0.04154 | -0.014301 | -0.061712 | -0.029553 | -0.087432 | -0.030138 | -0.007584 | 0.037859 | 0.007667 | 0.030414 | 0.047172 | -0.016037 | 0.001563 | -0.041598 | -0.020582 | 0.016092 | 0.054174 | -0.016368 | -0.001258 | -0.014581 | 0.027247 | 0.028606 | -4.490986 |
# converting coefficients to odds
odds = np.exp(lg.coef_[0])
# finding the percentage change
perc_change_odds = (np.exp(lg.coef_[0]) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train.columns).T
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Income_log | ZIP_City_Alameda | ZIP_City_Alamo | ZIP_City_Albany | ZIP_City_Alhambra | ZIP_City_Anaheim | ZIP_City_Antioch | ZIP_City_Aptos | ZIP_City_Arcadia | ZIP_City_Arcata | ZIP_City_Bakersfield | ZIP_City_Baldwin Park | ZIP_City_Banning | ZIP_City_Bella Vista | ZIP_City_Belmont | ZIP_City_Belvedere Tiburon | ZIP_City_Ben Lomond | ZIP_City_Berkeley | ZIP_City_Beverly Hills | ZIP_City_Bodega Bay | ZIP_City_Bonita | ZIP_City_Boulder Creek | ZIP_City_Brea | ZIP_City_Brisbane | ZIP_City_Burlingame | ZIP_City_Calabasas | ZIP_City_Camarillo | ZIP_City_Campbell | ZIP_City_Canoga Park | ZIP_City_Capistrano Beach | ZIP_City_Capitola | ZIP_City_Cardiff By The Sea | ZIP_City_Carlsbad | ZIP_City_Carpinteria | ZIP_City_Carson | ZIP_City_Castro Valley | ZIP_City_Ceres | ZIP_City_Chatsworth | ZIP_City_Chico | ZIP_City_Chino | ZIP_City_Chino Hills | ZIP_City_Chula Vista | ZIP_City_Citrus Heights | ZIP_City_Claremont | ZIP_City_Clearlake | ZIP_City_Clovis | ZIP_City_Concord | ZIP_City_Costa Mesa | ZIP_City_Crestline | ZIP_City_Culver City | ZIP_City_Cupertino | ZIP_City_Cypress | ZIP_City_Daly City | ZIP_City_Danville | ZIP_City_Davis | ZIP_City_Diamond Bar | ZIP_City_Edwards | ZIP_City_El Dorado Hills | ZIP_City_El Segundo | ZIP_City_El Sobrante | ZIP_City_Elk Grove | ZIP_City_Emeryville | ZIP_City_Encinitas | ZIP_City_Escondido | ZIP_City_Eureka | ZIP_City_Fairfield | ZIP_City_Fallbrook | ZIP_City_Fawnskin | ZIP_City_Folsom | ZIP_City_Fremont | ZIP_City_Fresno | ZIP_City_Fullerton | ZIP_City_Garden Grove | ZIP_City_Gilroy | ZIP_City_Glendale | ZIP_City_Glendora | ZIP_City_Goleta | ZIP_City_Greenbrae | ZIP_City_Hacienda Heights | ZIP_City_Half Moon Bay | ZIP_City_Hawthorne | ZIP_City_Hayward | ZIP_City_Hermosa Beach | ZIP_City_Highland | ZIP_City_Hollister | ZIP_City_Hopland | ZIP_City_Huntington Beach | ZIP_City_Imperial | ZIP_City_Inglewood | ZIP_City_Irvine | ZIP_City_La Jolla | ZIP_City_La Mesa | ZIP_City_La Mirada | ZIP_City_La Palma | ZIP_City_Ladera Ranch | ZIP_City_Laguna Hills | ZIP_City_Laguna Niguel | ZIP_City_Lake Forest | ZIP_City_Larkspur | ZIP_City_Livermore | ZIP_City_Loma Linda | ZIP_City_Lomita | ZIP_City_Lompoc | ZIP_City_Long Beach | ZIP_City_Los Alamitos | ZIP_City_Los Altos | ZIP_City_Los Angeles | ZIP_City_Los Gatos | ZIP_City_Manhattan Beach | ZIP_City_March Air Reserve Base | ZIP_City_Marina | ZIP_City_Martinez | ZIP_City_Menlo Park | ZIP_City_Merced | ZIP_City_Milpitas | ZIP_City_Mission Hills | ZIP_City_Mission Viejo | ZIP_City_Modesto | ZIP_City_Monrovia | ZIP_City_Montague | ZIP_City_Montclair | ZIP_City_Montebello | ZIP_City_Monterey | ZIP_City_Monterey Park | ZIP_City_Moraga | ZIP_City_Morgan Hill | ZIP_City_Moss Landing | ZIP_City_Mountain View | ZIP_City_Napa | ZIP_City_National City | ZIP_City_Newbury Park | ZIP_City_Newport Beach | ZIP_City_North Hills | ZIP_City_North Hollywood | ZIP_City_Northridge | ZIP_City_Norwalk | ZIP_City_Novato | ZIP_City_Oak View | ZIP_City_Oakland | ZIP_City_Oceanside | ZIP_City_Ojai | ZIP_City_Orange | ZIP_City_Oxnard | ZIP_City_Pacific Grove | ZIP_City_Pacific Palisades | ZIP_City_Palo Alto | ZIP_City_Palos Verdes Peninsula | ZIP_City_Pasadena | ZIP_City_Placentia | ZIP_City_Pleasant Hill | ZIP_City_Pleasanton | ZIP_City_Pomona | ZIP_City_Porter Ranch | ZIP_City_Portola Valley | ZIP_City_Poway | ZIP_City_Rancho Cordova | ZIP_City_Rancho Cucamonga | ZIP_City_Rancho Palos Verdes | ZIP_City_Redding | ZIP_City_Redlands | ZIP_City_Redondo Beach | ZIP_City_Redwood City | ZIP_City_Reseda | ZIP_City_Richmond | ZIP_City_Ridgecrest | ZIP_City_Rio Vista | ZIP_City_Riverside | ZIP_City_Rohnert Park | ZIP_City_Rosemead | ZIP_City_Roseville | ZIP_City_Sacramento | ZIP_City_Salinas | ZIP_City_San Anselmo | ZIP_City_San Bernardino | ZIP_City_San Bruno | ZIP_City_San Clemente | ZIP_City_San Diego | ZIP_City_San Dimas | ZIP_City_San Francisco | ZIP_City_San Gabriel | ZIP_City_San Jose | ZIP_City_San Juan Bautista | ZIP_City_San Juan Capistrano | ZIP_City_San Leandro | ZIP_City_San Luis Obispo | ZIP_City_San Luis Rey | ZIP_City_San Marcos | ZIP_City_San Mateo | ZIP_City_San Pablo | ZIP_City_San Rafael | ZIP_City_San Ramon | ZIP_City_San Ysidro | ZIP_City_Sanger | ZIP_City_Santa Ana | ZIP_City_Santa Barbara | ZIP_City_Santa Clara | ZIP_City_Santa Clarita | ZIP_City_Santa Cruz | ZIP_City_Santa Monica | ZIP_City_Santa Rosa | ZIP_City_Santa Ynez | ZIP_City_Saratoga | ZIP_City_Sausalito | ZIP_City_Seal Beach | ZIP_City_Seaside | ZIP_City_Sherman Oaks | ZIP_City_Sierra Madre | ZIP_City_Signal Hill | ZIP_City_Simi Valley | ZIP_City_Sonora | ZIP_City_South Gate | ZIP_City_South Lake Tahoe | ZIP_City_South Pasadena | ZIP_City_South San Francisco | ZIP_City_Stanford | ZIP_City_Stinson Beach | ZIP_City_Stockton | ZIP_City_Studio City | ZIP_City_Sunland | ZIP_City_Sunnyvale | ZIP_City_Sylmar | ZIP_City_Tahoe City | ZIP_City_Tehachapi | ZIP_City_Thousand Oaks | ZIP_City_Torrance | ZIP_City_Trinity Center | ZIP_City_Tustin | ZIP_City_Ukiah | ZIP_City_Upland | ZIP_City_Valencia | ZIP_City_Vallejo | ZIP_City_Van Nuys | ZIP_City_Venice | ZIP_City_Ventura | ZIP_City_Vista | ZIP_City_Walnut Creek | ZIP_City_Weed | ZIP_City_West Covina | ZIP_City_West Sacramento | ZIP_City_Westlake Village | ZIP_City_Whittier | ZIP_City_Woodland Hills | ZIP_City_Yorba Linda | ZIP_City_Yucaipa | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.999233 | 1.002003 | 1.015113 | 1.125253 | 1.019712 | 1.251518 | 1.000042 | 999.155297 | 2567.921185 | 1.942888 | 0.999010 | 8354.334619 | 0.651025 | 1.000073 | 0.988372 | 1.001097 | 0.930945 | 0.969551 | 1.0 | 0.967760 | 1.014308 | 0.919183 | 0.975788 | 0.987969 | 0.968578 | 1.000001 | 0.993985 | 0.991967 | 0.962705 | 0.972216 | 1.025642 | 1.025817 | 0.984989 | 0.941292 | 0.987288 | 0.973438 | 0.957739 | 1.057016 | 1.010205 | 1.002574 | 1.041632 | 0.966379 | 1.000839 | 0.98525 | 0.938250 | 1.035222 | 0.994530 | 1.020846 | 1.0 | 0.996862 | 0.941663 | 0.993260 | 1.061973 | 1.110754 | 0.969052 | 1.001282 | 1.000810 | 1.030245 | 1.051902 | 0.966906 | 0.975526 | 0.978026 | 0.980312 | 0.944258 | 1.022454 | 1.0 | 0.922032 | 1.067097 | 0.990259 | 0.991967 | 0.884780 | 1.045352 | 1.009884 | 1.032452 | 0.977708 | 1.014028 | 1.030371 | 1.043626 | 0.992857 | 1.055983 | 1.00110 | 0.908852 | 0.958605 | 0.953254 | 1.020662 | 1.068665 | 1.032897 | 1.062816 | 1.030232 | 1.003780 | 0.971089 | 0.981730 | 1.049662 | 0.962792 | 0.934094 | 0.958247 | 1.026133 | 0.981213 | 0.969030 | 1.013950 | 0.994285 | 1.15031 | 1.005936 | 0.969920 | 0.991089 | 0.968051 | 1.0 | 1.025341 | 0.97307 | 1.004453 | 1.007281 | 1.011250 | 0.928068 | 1.035531 | 1.090748 | 0.935100 | 1.055003 | 1.081579 | 0.999699 | 1.059644 | 0.962890 | 1.016986 | 1.007147 | 1.068860 | 0.942271 | 0.940811 | 0.977334 | 0.982129 | 1.007926 | 1.046448 | 1.044559 | 1.007530 | 0.958804 | 0.977923 | 1.018326 | 0.984998 | 1.015064 | 1.002702 | 1.143089 | 1.007204 | 1.0 | 0.974043 | 0.929348 | 0.978771 | 0.981574 | 0.958155 | 0.926506 | 1.006062 | 1.036549 | 1.033155 | 0.940899 | 1.029777 | 1.049595 | 1.019950 | 0.998142 | 1.051398 | 0.991147 | 1.091911 | 1.025612 | 1.153495 | 0.983637 | 0.970898 | 1.078560 | 0.923253 | 1.0 | 0.977165 | 1.035706 | 0.943370 | 0.957451 | 0.981120 | 1.009967 | 1.052116 | 0.946886 | 0.904497 | 0.98388 | 1.087707 | 0.944020 | 0.990860 | 0.977907 | 1.003740 | 0.974874 | 1.021558 | 1.088895 | 0.996185 | 1.075916 | 1.012075 | 0.982009 | 1.055788 | 0.996155 | 0.980667 | 1.000799 | 1.027429 | 1.128950 | 0.994520 | 0.992188 | 0.990522 | 1.071996 | 1.024639 | 0.933964 | 0.944536 | 1.018019 | 1.006212 | 0.999019 | 1.086758 | 0.966173 | 0.994338 | 0.969422 | 0.996882 | 1.035177 | 1.075356 | 1.060044 | 0.997665 | 1.003456 | 1.039698 | 1.038587 | 0.924493 | 1.081066 | 0.973583 | 1.0 | 1.036383 | 1.003901 | 1.020506 | 0.996501 | 0.997668 | 0.977152 | 0.887182 | 1.042284 | 1.0 | 0.977008 | 0.966438 | 0.964419 | 0.981433 | 1.027622 | 1.0 | 1.042415 | 0.985801 | 0.940154 | 0.970879 | 0.916281 | 0.970311 | 0.992445 | 1.038584 | 1.007696 | 1.030881 | 1.048303 | 0.984091 | 1.001564 | 0.959256 | 0.979628 | 1.016222 | 1.055668 | 0.983765 | 0.998743 | 0.985524 | 1.027622 | 1.029019 |
| Change_odd% | -0.076741 | 0.200318 | 1.511306 | 12.525286 | 1.971225 | 25.151801 | 0.004201 | 99815.529699 | 256692.118509 | 94.288828 | -0.098979 | 835333.461945 | -34.897514 | 0.007257 | -1.162795 | 0.109737 | -6.905510 | -3.044886 | 0.0 | -3.224004 | 1.430809 | -8.081691 | -2.421156 | -1.203092 | -3.142188 | 0.000108 | -0.601534 | -0.803335 | -3.729452 | -2.778353 | 2.564223 | 2.581684 | -1.501106 | -5.870779 | -1.271175 | -2.656204 | -4.226122 | 5.701634 | 1.020463 | 0.257352 | 4.163205 | -3.362117 | 0.083869 | -1.47497 | -6.175027 | 3.522214 | -0.547003 | 2.084574 | 0.0 | -0.313814 | -5.833711 | -0.674039 | 6.197345 | 11.075432 | -3.094789 | 0.128214 | 0.081041 | 3.024524 | 5.190170 | -3.309399 | -2.447371 | -2.197351 | -1.968766 | -5.574171 | 2.245438 | 0.0 | -7.796778 | 6.709676 | -0.974064 | -0.803286 | -11.522012 | 4.535224 | 0.988401 | 3.245188 | -2.229250 | 1.402799 | 3.037064 | 4.362564 | -0.714258 | 5.598336 | 0.10996 | -9.114808 | -4.139470 | -4.674598 | 2.066152 | 6.866497 | 3.289718 | 6.281587 | 3.023190 | 0.377977 | -2.891067 | -1.827009 | 4.966186 | -3.720824 | -6.590564 | -4.175349 | 2.613313 | -1.878706 | -3.096997 | 1.395005 | -0.571487 | 15.03105 | 0.593593 | -3.008026 | -0.891109 | -3.194933 | 0.0 | 2.534137 | -2.69303 | 0.445302 | 0.728078 | 1.125047 | -7.193214 | 3.553140 | 9.074751 | -6.490028 | 5.500291 | 8.157917 | -0.030059 | 5.964388 | -3.710968 | 1.698598 | 0.714672 | 6.886047 | -5.772947 | -5.918946 | -2.266574 | -1.787131 | 0.792595 | 4.644785 | 4.455899 | 0.752975 | -4.119556 | -2.207668 | 1.832593 | -1.500165 | 1.506379 | 0.270154 | 14.308890 | 0.720426 | 0.0 | -2.595657 | -7.065150 | -2.122877 | -1.842586 | -4.184539 | -7.349354 | 0.606193 | 3.654931 | 3.315533 | -5.910056 | 2.977704 | 4.959506 | 1.995011 | -0.185785 | 5.139831 | -0.885318 | 9.191101 | 2.561203 | 15.349479 | -1.636278 | -2.910241 | 7.856007 | -7.674670 | 0.0 | -2.283461 | 3.570576 | -5.663034 | -4.254855 | -1.888036 | 0.996662 | 5.211629 | -5.311372 | -9.550293 | -1.61197 | 8.770697 | -5.597959 | -0.913952 | -2.209253 | 0.373954 | -2.512585 | 2.155818 | 8.889506 | -0.381494 | 7.591595 | 1.207483 | -1.799127 | 5.578843 | -0.384514 | -1.933274 | 0.079942 | 2.742916 | 12.895015 | -0.547971 | -0.781189 | -0.947819 | 7.199613 | 2.463929 | -6.603588 | -5.546351 | 1.801852 | 0.621235 | -0.098132 | 8.675831 | -3.382672 | -0.566179 | -3.057798 | -0.311815 | 3.517655 | 7.535554 | 6.004415 | -0.233510 | 0.345563 | 3.969839 | 3.858742 | -7.550658 | 8.106568 | -2.641687 | 0.0 | 3.638349 | 0.390070 | 2.050581 | -0.349905 | -0.233236 | -2.284807 | -11.281781 | 4.228399 | 0.0 | -2.299169 | -3.356152 | -3.558097 | -1.856723 | 2.762214 | 0.0 | 4.241513 | -1.419888 | -5.984615 | -2.912093 | -8.371896 | -2.968857 | -0.755497 | 3.858434 | 0.769617 | 3.088135 | 4.830260 | -1.590939 | 0.156408 | -4.074446 | -2.037202 | 1.622219 | 5.566826 | -1.623508 | -0.125687 | -1.447550 | 2.762181 | 2.901881 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_train, y_train)
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_test, y_test)
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test
)
print("Test set performance:")
log_reg_model_test_perf
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
logit_roc_auc_train = roc_auc_score(y_train, lg.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
logit_roc_auc_test = roc_auc_score(y_test, lg.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.9101029805309264
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999712 | 0.999331 | 1.0 | 0.999665 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.996644 | 0.992063 | 1.0 | 0.996016 |
y_scores = lg.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(20, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.94
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997123 | 0.993311 | 1.0 | 0.996644 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.91 Threshold",
"Logistic Regression-0.94 Threshold"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.91 Threshold | Logistic Regression-0.94 Threshold | |
|---|---|---|---|
| Accuracy | 1.0 | 0.999712 | 0.997123 |
| Recall | 1.0 | 0.999331 | 0.993311 |
| Precision | 1.0 | 1.000000 | 1.000000 |
| F1 | 1.0 | 0.999665 | 0.996644 |
Try pruning technique(s) - Evaluate the model on appropriate metric - Comment on model performance
Predicting a customer will contribute to purchase of Personal loan but in reality the customer may not buy personal loans. - Waste of campaign funds
Predicting a customer will NOT contribute to purchase of Personal loan but in reality the customer may buy personal loans. - Loss of opportunity for campaigns
## Function to calculate recall score
def get_recall_score(model, predictors, target):
"""
model: classifier
predictors: independent variables
target: dependent variable
"""
prediction = model.predict(predictors)
return recall_score(target, prediction)
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = get_recall_score(model, X_train, y_train)
print("Recall Score:", decision_tree_perf_train)
Recall Score: 1.0
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = get_recall_score(model, X_test, y_test)
print("Recall Score:", decision_tree_perf_test)
Recall Score: 1.0
## creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- CreditCard <= 0.50 | |--- Securities_Account <= 0.50 | | |--- Personal_Loan <= 0.50 | | | |--- weights: [297.15, 0.00] class: False | | |--- Personal_Loan > 0.50 | | | |--- ZIP_City_San Ramon <= 0.50 | | | | |--- weights: [0.00, 173.40] class: True | | | |--- ZIP_City_San Ramon > 0.50 | | | | |--- weights: [0.00, 0.85] class: True | |--- Securities_Account > 0.50 | | |--- ZIP_City_Laguna Hills <= 0.50 | | | |--- weights: [0.00, 221.85] class: True | | |--- ZIP_City_Laguna Hills > 0.50 | | | |--- weights: [0.00, 0.85] class: True |--- CreditCard > 0.50 | |--- weights: [0.00, 873.80] class: True
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(50,50))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -7.379968e-14 |
| 1 | 3.516551e-15 | -7.028313e-14 |
| 2 | 6.786211e-15 | -6.349692e-14 |
| 3 | 1.024019e-01 | 3.072056e-01 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.10240186673580605
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas,
recall_train,
marker="o",
label="train",
drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(best_model, X_train, y_train)
print("Recall Score:", get_recall_score(best_model, X_train, y_train))
Recall Score: 1.0
#### checking performance on test set
confusion_matrix_sklearn(best_model, X_test, y_test)
print("Recall Score:", get_recall_score(best_model, X_test, y_test))
Recall Score: 1.0
### Visualizing the Decision Tree
plt.figure(figsize=(5, 5))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- CreditCard <= 0.50 | |--- Securities_Account <= 0.50 | | |--- Personal_Loan <= 0.50 | | | |--- weights: [297.15, 0.00] class: False | | |--- Personal_Loan > 0.50 | | | |--- ZIP_City_San Ramon <= 0.50 | | | | |--- weights: [0.00, 173.40] class: True | | | |--- ZIP_City_San Ramon > 0.50 | | | | |--- weights: [0.00, 0.85] class: True | |--- Securities_Account > 0.50 | | |--- ZIP_City_Laguna Hills <= 0.50 | | | |--- weights: [0.00, 221.85] class: True | | |--- ZIP_City_Laguna Hills > 0.50 | | | |--- weights: [0.00, 0.85] class: True |--- CreditCard > 0.50 | |--- weights: [0.00, 873.80] class: True